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Abstract 

We introduce a stochastic process with Wishart marginals: the generalised Wishart process 
(GWP). It is a collection of positive semi-definite random matrices indexed by any arbitrary 
dependent variable. We use it to model dynamic (e.g. time varying) covariance matrices £(t). 
Unlike existing models, it can capture a diverse class of covariance structures, it can easily 
handle missing data, the dependent variable can readily include covariates other than time, 
and it scales well with dimension; there is no need for free parameters, and optional parameters 
are easy to interpret. We describe how to construct the GWP, introduce general procedures 
for inference and predictions, and show that it outperforms its main competitor, multivariate 
GARCH, even on financial data that especially suits GARCH. We also show how to predict 
the mean n(t) of a multivariate process while accounting for dynamic correlations. 



1 Introduction 



Imagine the price of the NASDAQ composite index increased dramatically today. Will it continue to rise 
tomorrow? Should you invest? Perhaps this is the beginning of a trend, but it may also be an anomaly. 
Now suppose you discover that every major equity index - FTSE, NIKKEI, TSE, etc. - has also risen. 
Instinctively the rise in NASDAQ was not anomalous, because this market is correlated with other major 
indices. This is an example of how multivariate models which account for correlations can be better than 
univariate models at making univariate predictions. 

In this paper, we are concerned with modelling the dynamic covariance matrix £(f) of high dimensional 



data sets (multivariate volatility). These models are especially important in econometrics. Brownlees et al 



(2009) remark that "The price of essentially every derivative security is affected by swings in volatility. Risk 



management models used by financial institutions and required by regulators take time-varying volatility 
as a key input. Poor appraisal of the risks to come can leave investors excessively exposed to market 
fluctuations or institutions hanging on a precipice of inadequate capital". Indeed, Robert Engle and 
Clive Granger won the 2003 Nobel prize in economics "for methods of analysing economic time series 
with time- varying volatility" . The returns on major equity indices and currency exchanges are thought 
to have a time changing variance and zero mean, and GARCH (Bollerslev 1986), a generalization of 



Englc's ARCH (Engle 1982), is arguably unsurpassed at predicting the volatilities of returns on these 



equity indices and currency exchanges (Poon and Granger 2005 Hansen and Lunde 2005 Brownlees 



et al. 20091. Multivariate volatility models can be used to understand the dynamic correlations (or co- 



movement) between equity indices, and can make better univariate predictions than univariate models. 
A good estimate of the covariance matrix E(i) is also necessary for portfolio management. An optimal 



*http: / /mlg. eng. cam. ac.uk/andrew 



portfolio allocation w* is said to maximise the Sharpe ratio ( Sharpel 1966) 



Portfolio return 



w T r(t) 



Portfolio volatility ^w T T,(t)w ' 



(1) 



where r(t) are expected returns for each asset and is the predicted covariance matrix for these returns 
One may also wish to maximise the portfolio return w T r(t) for a fixed level of volatili ty: ^ w 1 Y,( t)w = 
A. Sharpe, Markowitz and Merton jointly received a Nobel prize for portfolio theory ( Markowitzl 1952 



Merton 1972). Multivariate volatility models are also used to understand contagion: the transmission of 

in econometrics, machine 



a financial shock from one entity to another (Bae et al 
learning, climate science, or otherwise 
entities. 



2003). And generally 



it is useful to know the dynamic correlations between multiple 



Despite their importance, existing multivariate volatility models suffer from tractability issues and a lack 
of generality. For example, multivariate GARCH (MGARCH) has a number of free parameters that scales 
with dimension to the fourth power, and interpretation and estimation of these parameters is difficult to 
impossible ( |Silvennoinen and Terasvirta 2009 Gourieroux 1997), given the constraint that E(i) must be 
positive definite at all points in time. Thus MGARCH, and alternative multivariate stochas tic volatility 
(MSV) models^] are generally limited to studying processes with less than 5 components (Gourieroux 



et al. 2009]). Recent efforts have led to simpler but less general models, which make assumptions such as 
constant correlations (Bollerslev 1990) - leaving only the diagonal entries of £(<) to vary. 



We hope to unite machine learning and econometrics in an effort to solve these problems. We introduce 
a stochastic process with Wishart marginals: the generalised Wishart process (GWP). It is a collection 
of positive semi-definite random matrices indexed by any arbitrary dependent variable z. We call it the 



generalised Wishart process, since it is a generalisation of the first Wishart process defined by Bru ( 1991 ). 



Bru's Wishart process has recently been used ( Gourieroux et al. 2009 ) in multivariate stochastic volatility 



(MSV) models (Philipov and Glickman [2006 



Harvey et al. 



1994). This prior work on Wishart processes 
is limited for several reasons: 1) it assumes the dependent variable is a scalar, 2) it is restricted to using 
an Ornstein-Uhlenbeck covariance structur^] (which means £(£ + a) and T,{t — a) are independent given 
£(£), and complex dependencies cannot be captured), 3) it is autoregressive, and 4) there are no general 
learning and inference procedures. The generalised Wishart process (GWP) addresses all of these issues. 
Specifically, in the GWP formulation, 

• The dependent variable can come from any arbitrary index set, just as easily as it can represent time. 
This allows one to effortlessly condition on covariates like interest rates. 

• One can easily handle missing data. 

• One can easily specify a range of covariance structures (periodic, smooth, Ornstein-Uhlenbeck, . . . ). 

• We develop Bayesian inference procedures to make predictions, and to learn distributions over any 
relevant parameters. Aspects of the covariance structure are learned from data, rather than being a 
fixed property of the model. 

Overall, the GWP is versatile and simple. It does not require any free parameters, and any optional 
parameters are easy to interpret. For this reason, it also scales well with dimension. Yet, the GWP provides 
an especially general description of multivariate volatility - more so than the most general MGARCH 
specifications. In the next section, we review Gaussian processes (GPs), which are used to construct the 
Wishart process. In the following sections we then review the Wishart distribution, present the GWP 
construction, introduce procedures for inference and predictions, review the main competitor, MGARCH, 
and present experiments that show how the GWP outperforms MGARCH on simulated and financial data. 



unlike GARCH which 



1 MSV models, pioneered by |Harvey et ah] | |l994} , assume volatility follows a random process, 

assumes it is a deterministic functio n of the past. 

An Ornstein-Uhlenbeck process jUhlenback and Ornstein| |l930[ > was first introduced to model the velocity of a particle 
undergoing Brownian motion. 
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These experiments include a 5 dimensional data set, based on returns for NASDAQ, FTSE, NIKKEI, TSE, 
and the Dow Jones Composite, and a set of returns for 3 foreign currency exchanges. In a subsequent 
version we will also present a 200 dimensional experiment to show how the GWP can be used to study 
high dimensional problems. 

Also, although it is not the focus of this paper, we show in the inference section how the GWP can 
additionally be used as part of a new GP based regression model that accounts for changing correlations. 
In other words, it can be used to predict the mean fj,(t) together with the covariance matrix £(t) of a 
multivariate process. Alternative GP based multivariate regression models for fi(t), which account for 
fixed correlations, were recently introduced by Bonilla et al. (20081, Teh et al. (2005), and Boyle and Frean 



(2004). 



2 Gaussian Processes 



We briefly review Gaussian processes, since the generalised Wishart process is constructed from GPs. For 



more detail, see Rasmussen and Williams ( 2006 ) 



A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian 
distribution. Using a Gaussian process, we can define a distribution over functions u(z): 



u(z)~gV(m(z),k(z,z')) 



(2) 



where z is an arbitrary (potentially vector valued) dependent variable, and the mean m(z) and kernel 
function k(z, z') are respectively defined as 



m(z) = E[u(z)] , 
k(z, z') — cov(u(z), u{z')) . 

This means that any collection of function values has a joint Gaussian distribution: 

(u(z 1 ),u(z 2 ), u(z N )) T ~ Af(fi, K) , 



(3) 
(4) 

(5) 



where the N x N Gram matrix K has entries — k{z i: z,j), and the mean fi has entries /x; = m(zj). 
The properties of these functions (smoothness, periodicity, etc.) are determined by the kernel function. 
The squared exponential kernel is popular: 



k(z,z') = exp(-0.5||z-z / ||7Z 2 ). 



(6) 



Functions drawn from a Gaussian process with this kernel function are smooth, and can display long range 
trends. The length-scale hyperparameter I is easy to interpret: it determines how much the function values 
u(z) and u(z + a) depend on one another, for some constant a. 

The autoregressive process 



u(f + 1) =u(t)+e(t), 
is an example of a Gaussian process with a fixed covariance structure. 



(7) 

(8) 



3 Wishart Distribution 



The Wishart distribution defines a probability density function over positive definite matrices S: 

\ S \{v-D-l)/2 j 



(9) 
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where V is a D x D positive definite scale matrix, and v > is the number of degrees of freedom. This 
distribution has mean vV and mode (D — v — 1)V for v > D + l. Trj(-) is the multivariate gamma function: 

D 

T D (u/2) = 7T D ( D - 1 )/ 4 Y[ T ( v /2 + (1 - j)/2) . (10) 

3=1 

The Wishart distribution is a multivariate generalisation of the Gamma distribution when v is real valued, 
and the chi-square (x 2 ) distribution when v is integer valued. The sum of squares of univariate Gaussian 
random variables is chi-squared distributed. Likewise, the sum of outer products of multivariate Gaussian 
random variables is Wishart distributed: 

v 

5 = ^m,m7 '~W d (V,v), (11) 
i=i 

where the Ui are i.i.d. Af(0, V) Z?-dimensional random variables, and Wd(V, v) is a Wishart distribution 
with D x D scale matrix V, and v degrees of freedom. S is a D x D positive definite matrix. If D = V — 1 
then W is a chi-square distribution with v degrees of freedom. S^ 1 has the inverse Wishart distribution, 
W^y- 1 ,!/), which is a conjugate prior for covariance matrices of zero mean Gaussian distributions. This 
means that for data D if a prior p(R) is inverse Wishart, and the likelihood p(T>\R) is Gaussian with zero 
mean, then the posterior p(R\T>) is also inverse Wishart. 



4 Generalised Wishart Process Construction 

We saw that the Wishart distribution is constructed from multivariate Gaussian distributions. Essentially, 
by replacing these Gaussian distributions with Gaussian processes, we define a process with Wishart 
marginals - the generalised Wishart process. It is a collection of positive semi-definite random matrices 
indexed by any arbitrary (potentially high dimensional) dependent variable z. For clarity, we assume that 
time is the dependent variable, even though it takes no more effort to use a vector-valued variable z from 
any arbitrary set. Everything we write would still apply if we replaced t with z. 

Suppose we have vD independent Gaussian process functions, Ui d {t) ~ QV(0,k), where i — l,...,p 
and d = 1,...,D. This means cov(u ld (t), u id {t')) = k(t,i!)8 i i/8 d , d - , and (u id (ti), u id (t 2 ), . . . , u id (t N )) T ~ 
A/"(0, K), where 8%j is the Kronecker delta, and K is an N x N Gram matrix with elements Kij = k(ti, tj). 
Let iii(t) = (un(t), . . . ,un)(t)) T , and let L be the lower Cholesky decomposition of a D x D scale matrix 
V, such that LL T — V. Then at each t the covariance matrix T,(t) has a Wishart marginal distribution, 



V 

E(f) - ]T Ltii(t)uJ (t)L T ~ W D (V, v) , (12) 



i=i 



subject to the constraint that the kernel function k(t,t) = 1. 



We can understand (12 1 as follows. Each element of the vector iii(t) is a univariate Gaussian with 
zero mean and variance k(t,t) — 1. Since these elements are uncorrelated, Ui(t) ~ Af(0,I). Therefore 
Lui(t) ~ jV(0, V), since E[Lu,i(i)Mi(<) T L T ] = LIL T = LL T = V. We are summing the outer products of 
Af(0, V) random variables, and there are v terms in the sum, so by definition this has a Wishart distribution 
Wd(V,v). 

We write £(t) ~ GWV(V, v, k(t, t')) to mean that £(t) is a collection of positive semi-definite random 
matrices with Wd(V,u) marginal distributions. Assuming the dependent variable is time, a draw from a 
Wishart process is a collection of matrices indexed by time (Figure [I]) , much like a draw from a Gaussian 
process is a collection of function values indexed by time. 
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Figure 1: A draw from a generalised Wishart process (GWP). Each ellipse is a 2 x 2 covariance matrix indexed by 
time, which increases from left to right. The rotation indicates the correlation between the two variables, and the 
major and minor axes scale with the eigenvalues of the matrix. Like a draw from a Gaussian process is a collection 
of function values indexed by time, a draw from a GWP is a collection of matrices indexed by time. 



Using this construction, we can also define a generalised inverse Wishart process (GIWP). If E(i) ~ GWP, 
then inversion at each value of t defines a draw R(t) — £(i) _1 from the GIWP. The conjugacy of the GIWP 
with a Gaussian likelihood could be useful when doing Bayesian inference. 



We can further extend this construction by replacing the Gaussian processes with copula processes (Wilson 



and Ghahramani 20101. For example, as part of Bayesian inference we could learn a mapping that would 



transform the Gaussian processes Uid to Gaussian copula processes with marginals that better suit the 
covariance structure of our data set; the result is a Wishart copula process. 

The formulation we outlined in this section is different from other multivariate volatility models in that 
one can specify a kernel function k(t,t') that controls how E(£) varies with t - for example, k(t,t') could 
be periodic - and t need not be time: it can be an arbitrary dependent variable, including covariates like 
interest rates. In the next section we introduce, for the first time, general inference procedures for making 
predictions when using a Wishart process prior. These are based on recently developed Markov chain 



Monte Carlo techniques (Murray et al. 2010). We also introduce a new method for doing multivariate GP 



based regression with dynamic correlations. 



5 Bayesian Inference 

Assume we have a generalised Wishart process prior on a dynamic D x D covariance matrix: 

E(j;)~gWP{V,v,k). (13) 

We want to infer the posterior E(i) given a D-dimensional data set V = {x(t n ) : n = 1, . . . ,N}. We 
explain how to do this for a general likelihood function, p(D|E(t)), by finding the posterior distributions 
over the parameters in the model, given the data T>. These parameters are: a vector of all relevant GP 
function values it, the hyperparameters of the GP kernel function 6, the degrees of freedom v, and L, the 
lower cholesky decomposition of the scale matrix V (LL T — V). The graphical model in Figure [2] shows 
all the relevant parameters and conditional dependence relationships. 



We can sample from these posterior distributions using Gibbs sampling (Geman and Geman 1984), a 
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Figure 2: Graphical model of the generalised Wishart process, u is a vector of GP function values, are GP 
hyperparameters, L is the lower Cholesky decomposition of the scale matrix (LL T = V), v are the degrees of 
freedom, and E is the covariance matrix. 



Markov chain Monte Carlo algorithm where initialising {it, 6, L, v\ and then sampling in cycles from 

p{u\6, L, v, V) oc p(V\u, L, v)p(u\0) , (14) 



p(P\u,L,v,V)<xp(u\0)p(0), 
p(L\0, u, v, V) oc p(V\u, L, v)p(L) , 
p{v\9, u, L, V) oc p(V\u, L, v)p(y) , 



(15) 
(16) 
(17) 



will converge to samples from p(u, 6, L, f\T>) 
posterior distributions 



14 , 15 



We will successively describe how to sample from the 
(|16|), and (1 1 Th . In our discussion we assume there are N data points 



(one at each time step or input), and D dimensions. We then explain how to make predictions of £(£*) 
at some test input i* . Finally, we discuss a potential likelihood function, and how the GWP could also be 
used as a new GP based model for multivariate regression with outputs that have changing correlations. 



5.1 Sampling the GP functions 



In this section we describe how to sample from the posterior distribution ( 14 ) over the Gaussian process 



function values u. We order the entries of u by fixing the degrees of freedom and dimension, and running 
the time steps from n = 1, ...,N. We then increment dimensions, and finally, degrees of freedom. So u 
is a vector of length NDv. As before, let K be an N x N Gram matrix, formed by evaluating the kernel 
function at all pairs of training inputs. Then the prior p(u\0) is a Gaussian distribution with NDv x NDv 
block diagonal covariance matrix Kb, formed using Dv of the K matrices; if the hyperparameters of 
the kernel function change depending on dimension or degrees of freedom, then these K matrices will be 
different from one another. In short, 

p{u\0)=N{U,K B ). (18) 
With this prior, and the likelihood formulated in terms of the other parameters, we can sample from the 



posterior (14). Sampling from this posterior is difficult, because the Gaussian process function values are 
highly correlated by the Kg matrix. We use Elliptical Slice Sampling (Murray et al. 2010): it has no 



free parameters, jointly updates every element of u, and was especially designed to sample from posteriors 
with correlated Gaussian priors. We found it effective. 



5.2 Sampling the other parameters 

We can similarly obtain distributions over the other parameters. The priors we use will depend on the 



data we are modelling. We placed a vague lognormal prior on and sampled from the posterior ( 15 ) using 



G 



axis aligned slice sampling if 6 was one dimensional, and Metropolis Hastings otherwise. We also used 
Metropolis Hastings to sample from (16), with a spherical Gaussian prior on the elements of L. To sample 
(17), one can use reversible jump MCMC (Green 1995 Robert and Casella 2004 ) . But in our experiments 
we set v = D + 1, and found it effective. Although learning L is not expensive, one might simply wish 
to set it by taking the empirical covariance of the data set, dividing by the degrees of freedom, and then 
taking the lower cholesky decomposition. 



5.3 Making predictions 

Once we have learned the parameters {it, 6, L, i>}, we can find a distribution over S(t*) at a test input i* 
To do this, we must infer the distribution over it* - all the relevant GP function values at t*: 



[till(f*) • ■■U 1D (t*)U2l(U) 
. . . U v l{U) ■ ■ ■ UvD^t*)] 1 . 



(19) 



Consider the joint distribution over u and it*: 



u 




" K B 


A T ' 


It* 


A 





(20) 



Supposing that it* and it respectively have p and q elements, then A is a p x q matrix of covariances between 
the GP function values li* and it at all pairs of the training and test inputs: Aij — fci(t*, ^ m od(iV+i,j)) if 
1 + (i — l)N < j < iN, and otherwise. The kernel function ki may differ from row to row, if it changes 
depending on the degree of freedom or dimension; for instance, we could have a different length-scale for 
each new dimension. I p is a p x p identity matrix representing the prior independence between the GP 
function values in it* 



Conditioning on it, we find 

it* |it -AfiAK^uJp - AK B 1 A T ) . 



(21) 



We can then construct £(i*) using equation (12) and the elements of it*. 



5.4 Likelihood function 

So far we have avoided making the likelihood explicit; the inference procedure we described will work 
with a variety of likelihoods parametrized through a matrix S(i), such as the multivariate t distribution. 
However, assuming for simplicity that each of the variables x(t n ) has a Gaussian distribution, 

aj(t)~jV(/i(t),S(t)), (22) 

then the likelihood is 

A? 

p(d|mW,£W) = l[p(x(t n )\n(t n ),Z(t n )) 

n=l 

N 1 

= [] |2^E(t n )r 1 / 2 exp[--i ( ;(t n ) T S(t n )- 1 i ( ;(t n )], (23) 

n=l 
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where w(t n ) = x(t n ) — /J,(t n ). We can learn a distribution over fi(t), in addition to £(£). Here are three 
possible specifications of fi: 



(24) 



2 = 1 

H(t) = u u +i(t) . 



(25) 
(26) 



In (24), the mean function is directly coupled to the covariance matrix in (12), since they are both 
constructed using the same Gaussian processes. As the components of fi increase in magnitude, so do the 
entries in E. This is a desirable property if we expect, for example, high returns to be associated with 



high volatility. This property is encouraged but not enforced in (25), where a separate vector of Gaussian 
processes u v+ i(t) is introduced into the expression for the mean function, but not the expression for £(£). 
In (26), the mean function is solely this separate vector of Gaussian processes. In each of these cases, we 
can make mean predictions by inferring distributions over the GP function values it, as outlined above. 
And so in each case, the GWP is being used as a GP based regression model which accounts for multiple 
outputs that have changing correlations. 



2008), 


Teh et al. 


(2005) 


24 


)-(26 


1. 


Gelfand et al. 



Boyle and Frean (2004) 



Rather than use a GP based regression, as in 
(120041) combine a spatial Wishart process with a parametric linear regression 
on the mean, to make correlated mean predictions in a spatial setting. Generally their methodology is 
substantially different: the correlation structure is a fixed property of their method (they do not learn 
the parameters of a kernel function), and they are not developing a multivariate volatility model, so are 
not interested in explicitly learning or evaluating the accuracy of the dynamic correlations; in fact, they 
do not explain how to make predictions of E at a test input. Further, they do not sample 9,L, or v, 
and their inference is not explained except that their sampling relies solely on Metropolis Hastings with 
Gaussian proposals, which will not scale to high dimensions, and will not mix efficiently as the strong GP 
prior correlations are not accounted for. They fix L as diagonal, which significantly limits the correlation 
structure of the dynamic covariance matrices (e.g. S(t)), as does fixing 6, which we have empirically found 
to severely affect the quality of predictions. In this paper we focus on making predictions of E(t), setting 
V = 0. 



5.5 Computational complexity 



In contrast to the alternatives, our method scales exceptionally nicely with dimension. MGARCH, the main 
competitor, is limited to 5 dimensions, at which point severe assumptions - such as constant correlations 
- are needed for tractability in higher dimensions. We conjecture that other Wishart process models are 



similarly limited, with Gelfand et al. (2004h restricted to about 2 dimensions. 



Our method is mainly limited by taking the cholesky decomposition of the block diagonal Kb, a NDv x 
NDv matrix. However, chol(blkdiag(A, B, . . . )) = blkdiag(chol(A), chol(B), . . . ). So in the case with 
equal length-scales for each dimension, we only need to take the cholesky of an N x N matrix K, an 
0(N 3 ) operation, independent of dimension! In the more general case with D different length-scales, it 
is an 0(DN 3 ) operation. Taking into account the likelihood, and other operations, the total training 
complexity is either 0(N 3 + vD 2 ) for equal length-scales, or 0(DN 3 + vD 2 ) using separate length-scales 
for each dimension. In the latter more general case, we could in principle go to about 1000 dimensions, 
assuming v is O(D), and for instance, a couple years worth of financial training data, which is typical for 
GARCH (Brownlees et al. 2009). Using sparse GP techniques we could go to even higher dimensions. In 



practice, MCMC may be infeasible for very high D, but we have found Elliptical Slice Sampling incredibly 
robust: we will shortly include a 200 dimensional experiment in an updated version. Overall, this is an 
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impressive scaling - without making further assumptions in our model, we can go well beyond 5 dimensions 
with full generality. 



6 Multivariate GARCH 



We compare predictions of E(£) made by the generalised Wishart process to those made by multivariate 



GARCH (MGARCH), since GARCH (Bollerslev 1986) is extremely popular and arguably unsurpassed 



at predicting the volatility of returns on equity indices and currency exchanges ( Poon and Granger 2005 



Hansen and Lundel 2005 Brownlees et al. 2009) 



Consider a zero mean D dimensional vector stochastic process x(t) with a time changing covariance matrix 



E(t) as in p2|>. In the general MGARCH framework, 

x(«)=S(i) 1 /2 T? (f), 



(27) 



where t/(£) is an i.i.d. vector white noise process with E[r](t)ri(t) T ] = I, and E(i) is the covariance matrix 
of x(t) conditioned on all information up until time t — 1. 



The first and most general MGARCH model, the VEC model of Bollerslev et al. (1988), specifies E t as 



vech(E t ) = A i vech(x t ^ i xJ_ i ) + B J vech(E t _ J -) 

i=l 3=1 



(28) 



Ai and Bj are D(D + l)/2 x D(D + l)/2 matrices of parameters, and ao is a D(D + l)/2 x 1 vector of 
parameters^ The vech operator stacks the columns of the lower triangular part of a D x D matrix into a 
vector of length D(D + l)/2. For example, vech(E) = (E U ,E 21) . . . , E m ,E 22 , ... , E D2 , . . . , T, DD ) T . This 
model is general, but difficult to use. There are (p + q)(D(D + l)/2) 2 + D(D + l)/2 parameters! These 
parameters are hard to interpret, and there are no conditions under which E t is positive definite for all 



t. Gourieroux (1997) discusses the challenging (and sometimes impossible) problem of keeping E t positive 



definite. Training is done by a constrained maximum likelihood, where the log likelihood is given by 



C 



N 

- — log(2.r)- l]T[log|£ t 
t=i 



(29) 



supposing that T] t ~ Af(0,I), and that there are N training points. 



Subsequent efforts have led to simpler but less general models. We can let Aj and Bj be diagonal matrices. 
This model has notably fewer (though still (p+q+l)D(D+l) /2) parameters, and there are conditions under 
which E t is positive definite for all t ( Engle et al. . 1994[ ) . But now there are no interactions between the 



different conditional variances and covariances. A popular variant assumes constant correlations between 
the D components of x, and only lets the marginal variances - the diagonal entries of E(f ) - vary ( Bollerslev 



1990). 



We compare to the 'full' BEKK variant of Engle and Kroner ( 1995 1, as implemented by Kevin Shepphard 
in the UCSD GARCH ToolboxQ We chose BEKK because it is the most general MGARCH variant in 
widespread use. We use the first order model: 



Et = CC T + A T xt-xxJ_ x A + B T ?, t - 1 B , 



(30) 



where A, B and C are D x D matrices of parameters. C is lower triangular to ensure that E t is posi- 
tive definite during maximum likelihood training. For a full review of multivariate GARCH models, see 



Silvennoinen and Terasvirta (2009). 



3 We use and St interchangeably. 

4 http: / /www. kevinshcppard.com/wiki/UCSD_GARCH 
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7 Experiments 



In our experiments, we predict the covariance matrix for multivariate observations x(t) as £(£) = E[E(t)|2>] 



These experiments closely follow ( Brownlees et al. 2009 ) , a rigorous empirical comparison of different 



GARCH models. We use a Gaussian likelihood, as in (231, except with a zero mean function. We make 
historical predictions, and one step ahead forecasts. Historical predictions are made at observed time 
points, or between these points. The one step ahead forecasts are predictions of E(£ + 1) taking into 
account all observations until time £. Historical predictions are important - for example, they can be used 
to understand the nature of covariances between different equity indices during a past financial crisis. 

To make these predictions we learn distributions over the GWP parameters through the Gibbs sampling 
procedure outlined in section 5. The kernel functions we use are solely parametrized by a one dimensional 
length-scale I, which indicates how dependent E(t) and E(£ + a) are on one another. We place a lognormal 
prior on the length-scale, and sample from the posterior with axis-aligned slice sampling. 

For each experiment, we choose a kernel function we want to use with the GWP. We then compare to a 
GWP that uses an Ornstein-Uhlenbeck (OU) kernel function, k(t,t') = exp(— \t — f\/l). Even though we 
are still taking advantage of the inference procedures in the GWP formulation, we refer to this variant of 



GWP as a simple Wishart process (WP), since the classic Bru (1991) construction is like a special case 
of our generalised Wishart process restricted to using a one dimensional Gaussian process with an OU 
covariance structure. 

To assess predictions we use the Mean Squared Error (MSE) between the predicted and true covariance 
matrices, which is always safe since we never observe the true E(£). When the truth is not known, we use 
the proxy Sij(t) = Xi(t)xj(t), to harmonize with the econometrics literature. Xj is the i th component of the 
multivariate observation x(t). This is intuitive because E[xi(£)x.,-(£)] = Ey(t), assuming x(t) has a zero 



mean. In a thorough empirical study, Brownlees et al. (2009) use the univariate analogue of this proxy. 



We do not use likelihood for assessing historical predictions, since that is a training error (for MGARCH), 
but we do use log likelihood (C) for forecasts. Although we give results when a proxy is used for historical 
assessments, this is also a sort of training error, and should be observed with caution. 

We begin by generating a 2 x 2 time varying covariance matrix E p (£) with periodic components, and 
simulating data at 291 time steps from a Gaussian distribution: 

x(t) ~ Af(0,E p (t)) . (31) 

Periodicity is especially common to financial and climate data, where daily trends repeat themselves. 
For example, the intraday volatility on equity indices and currency exchanges has a periodic covariance 
structure. Andersen and Bollerslev (1997) discuss the lack of - and critical need for - models that account 
for this periodicity. In the GWP formulation, we can easily account for this by using a periodic kernel 
function. We reconstruct E p (£) using the kernel fc(£,£') = exp(— 2sin((£ — t') 2 )/l 2 ). We reconstructed the 
historical E p at all 291 data points, and after having learned the parameters for each of the models from 
the first 200 data points, made one step forecasts for the last 91 points. Table [T] and Figure [3] show the 
results. We call this data set PERIODIC. The GWP outperforms the competition on all error measures. It 
identifies the periodicity and underlying smoothness of E p that neither the WP nor MGARCH accurately 
discern: both are too erratic. And MGARCH is especially poor at learning the time changing covariance 
(off-diagonal entry of E p ) in this data set. 

For our next experiment, we predict E(£) for the returns on three currency exchanges - the Canadian to US 
Dollar, the Euro to US Dollar, and the US Dollar to the Great Britain Pound - in the period 15/7/2008- 
15/2/2010; this encompasses the recent financial crisis and so is of particular interest to economists. We 
call this data set EXCHANGE. We use the proxy $y(t) = Xj(£)xj(£). With the GWP, we use the squared 
exponential kernel &(£,£') = exp(— 0.5(£ — t') 2 /l 2 ). We make 200 one step ahead forecasts, having learned 
the parameters for each of the models on the previous 200 data points. We also make 200 historical 
predictions for the same data points as the forecasts. The results are in Table [TJ 
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Table 1: Error for predicting multivariate volatility. 





MSE Historical 


MSE Forecast 


C Forecast 


PERIODIC (2D): 








GWP 


0.0841 


0.118 


-257 


WP 


0.458 


3.04 


-286 


MGARCH 


0.913 


1.95 


-270 


EXCHANGE (3D): 








GWP 


3.49 X 10~ 8 


4.32 X 10~ 8 


2020 


WP 


3.49 X 10~ 8 


6.28 x 10~ 8 


1950 


MGARCH 


3.56 x 10~ 8 


4.45 x 10~ 8 


2050 


EQUITY (5D): 








GWP 


7.01 X 10~ 8 


1.46 X 10- 7 


2930 


WP 


9.89 x 10~ 8 


2.23 x 10~ 7 


1710 


MGARCH 


16.7 x 10~ 8 


7.34 x 10~ 7 


2760 





100 200 300 

Time 
c) 



Figure 3: Reconstructing the historical £ p (t) for the PERIODIC data set. We show the truth (green), and GWP 
(blue), WP (dashed magenta), and MGARCH (thin red) predictions, a) and b) are the marginal variances (diagonal 
elements of ~E p (t)), and c) is the covariance (off-diagonal element of E p (t)). 
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Figure 4: Generating a return series using the empirical covariance of financial data. In a) are daily returns for 
the Toronto Stock Exchange (TSE). In b) is one component of observations generated from a known time- varying 
covariance matrix which emulates the covariance matrix for returns on equity indices. These two plots are not 
supposed to represent the same returns (e.g. we could have chosen NASDAQ instead of the TSE and kept panel 
b) the same), so the same trends will not be present. However, the simulated data has the same type of covariance 
structure as the financial data: it looks like financial data. 



Unfortunately, we cannot properly assess predictions on natural data, because we do not know the true 
E(f). For example, whether we use MSE with a proxy, or likelihood, historical predictions would be 
assessed with the same data used for training. In consideration of this problem, we generated a time 
varying covariance matrix E(£) based on the empirical time varying covariance of the daily returns on five 
equity indices - NASDAQ, FTSE, TSE, NIKKEI, and the Dow Jones Composite - over the period from 
15/2/1990-15/2/2010{^] We then generated a return series by sampling from a multivariate Gaussian at 
each time step using E(i). As seen in FigureJ^J the generated return series behaves like equity index returns. 
This method is not faultless; for example, we assume that the returns are normally distributed. However, 
the models we compare between also make this assumption, and so no model is given an unfair advantage 
over another. And there is a critical benefit: we can compare predictions with the true underlying T,(t). 

To make forecasts and historical predictions on this data set (EQUITY), we used a GWP with a squared 
exponential kernel, k(t, t') = exp(— 0.5(i — t') 2 /l 2 ). We follow the same procedure as before and make 200 
forecasts and historical predictions; results are in Tabic [l] 

Poon and Granger] 2005 



Both the EXCHANGE and EQUITY data sets are especially suited to GARCH 



Hansen and Lunde 



2005| |Brownlees et al.[ [2009} |Bollerslev and Gh^sels] [I~996 



McCullough and Renfro 



1998 Brooks et al. 2001 ). However, the generalised Wishart process outperforms GARCH on both of these 
data sets. Based on our experiments, there is evidence that the GWP is particularly good at capturing 
the co- variances (off-diagonal elements of £(t)) as compared to GARCH. The GWP also outperforms the 
WP, which has a fixed Ornstein-Uhlenbeck covariance structure, even though in our experiments the WP 
takes advantage of the new inference procedures we have derived. Thus the difference in performance is 
likely because the GWP is capable of capturing complex interdependencies in volatility, whereas the WP 
is not. 



5 We define a return as rt = log(P t +i/Pt), where Pt is the price on day t. 
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8 Discussion 



We introduced a stochastic process - the generalised Wishart process (GWP) - which we used to model 
time- varying covariance matrices E(f). Unlike the alternatives, the GWP can easily model a diverse class 
of covariance structures for £(£). In the future, the GWP could be applied to study how these covariance 
matrices depend on other variables, like interest rates, in addition to time. Future research could also 
apply the GWP to extremely high dimensional problems. 

We hope to unify efforts in machine learning and econometrics to inspire new multivariate volatility models 
that are simultaneously general, easy to interpret, and tractable in high dimensions. 
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